Abstract

  • Dataset: CREMA-D, six emotions (neutral, happy, sad, angry, fear, disgust)
  • Features: librosa, standardized, PCA (98% variance)
  • Models: SVM, Random Forest, MLP
  • MLP achieved macro F1-score: 0.5534

Introduction

  • Automatic emotion classification is challenging
  • CREMA-D: 7,442 clips, 91 actors
  • Features: MFCCs, spectral properties
  • Goal: Compare traditional, ensemble, and neural networks

Research Question

  • Q1. What is the classification accuracy of unsupervised methods for emotion recognition from acoustic features?
  • Q2. How do standalone algorithms, ensemble approaches, and neural networks compare in terms of accuracy, robustness, and computational efficiency for emotion classification from audio?

Exploratory Analysis

EDA

Code
# number of variables and observations in the data
print(f"Total observations: {df.shape[0]}")
print(f"Number of features: {df.shape[1]}")

# missing values in each column
missing_df = df.isna().sum()
print("Numeric Description:\n", df.describe())
Total observations: 7442
Number of features: 41
Numeric Description:
           actor_id  audio_duration  sample_rate  mfcc_1_mean   mfcc_1_std  \
count  7442.000000     7442.000000       7442.0  7442.000000  7442.000000   
mean   1046.084117        2.542910      22050.0  -387.893237    81.152242   
std      26.243152        0.505980          0.0    56.912883    30.241790   
min    1001.000000        1.267982      22050.0 -1131.370700     0.000122   
25%    1023.000000        2.202222      22050.0  -428.004615    58.292865   
50%    1046.000000        2.502540      22050.0  -399.767640    76.243485   
75%    1069.000000        2.836190      22050.0  -354.018697   101.771385   
max    1091.000000        5.005034      22050.0  -162.543350   179.528880   

       mfcc_2_mean   mfcc_2_std  mfcc_3_mean   mfcc_3_std  mfcc_4_mean  ...  \
count  7442.000000  7442.000000  7442.000000  7442.000000  7442.000000  ...   
mean    131.246557    26.349166     7.226425    31.621393    50.164769  ...   
std      15.557340     6.193413    11.605281    11.700738    11.128262  ...   
min       0.000000     0.000000   -52.340374     0.000000     0.000000  ...   
25%     122.297443    22.112123     0.674029    22.746978    42.658050  ...   
50%     134.065410    25.892035     8.938592    30.019723    50.710929  ...   
75%     142.583675    30.226819    15.246822    39.455707    58.216644  ...   
max     167.168330    63.146930    38.951794    73.551254    83.296500  ...   

       mfcc_13_std  spectral_centroid_mean  spectral_centroid_std  \
count  7442.000000             7442.000000            7442.000000   
mean      6.486291             1391.389433             569.861009   
std       1.678969              254.030203             284.309588   
min       0.000000                0.000000               0.000000   
25%       5.387114             1213.176905             356.601232   
50%       6.221665             1335.982229             507.373775   
75%       7.244229             1510.743971             725.217080   
max      24.734776             2873.927831            1699.906329   

       spectral_rolloff_mean  spectral_bandwidth_mean     rms_mean  \
count            7442.000000              7442.000000  7442.000000   
mean             2959.971329              1748.984424     0.027548   
std               471.242990               115.947135     0.028312   
min                 0.000000                 0.000000     0.000000   
25%              2648.811001              1679.011486     0.010933   
50%              2926.667949              1748.878905     0.016707   
75%              3211.563802              1815.428594     0.032007   
max              5258.158543              2163.024688     0.223023   

           rms_std     zcr_mean  chroma_mean   chroma_std  
count  7442.000000  7442.000000  7442.000000  7442.000000  
mean      0.027249     0.063343     0.389487     0.301066  
std       0.031667     0.023750     0.045327     0.012338  
min       0.000000     0.000000     0.000000     0.000000  
25%       0.008040     0.047160     0.359097     0.293473  
50%       0.015029     0.056566     0.389878     0.301533  
75%       0.033126     0.072023     0.420108     0.309572  
max       0.220164     0.233774     0.553840     0.335121  

[8 rows x 38 columns]

Target Class Count Plot

Data Preprocessing

  • Remove irrelevant columns, handle missing values
  • Apply Yeo-Johnson Power Transformation for numeric skew
  • Encode target labels

Sknewness Before & After Yeo Johnson Transformation

Feature engineering

  • Spectral Contrast: Measures amplitude differences between spectral peaks and valleys, capturing timbral characteristics that distinguish emotional expressions
  • MFCCs (Mel-frequency cepstral coefficients): Extract 13 coefficients representing the short-term power spectrum, fundamental for speech emotion recognition
  • Chroma Features: Capture pitch class energy distribution, providing harmonic content information relevant to emotional prosody
  • Zero-Crossing Rate: Quantifies signal noisiness by measuring zero-axis crossings, distinguishing between voiced and unvoiced speech segments
  • Root Mean Square (RMS) Energy: Measures overall signal energy, correlating with loudness and emotional intensity

Principal Component Analysis

PCA: retain 98% variance

Model Training & Evaluation

Model Evaulation Function

  • Inputs:
    • model → ML model instance (Random Forest, SVM, MLP)
    • X_train, X_test → feature matrices
    • y_train, y_test → labels
    • model_name → string for labeling outputs
  • Output:
    • Console print of metrics & confusion matrix
    • Dictionary with overall and per-class performance

Model Comparison

Model Performance

Model Accuracy Precision Recall F1-Score
RF 0.52 0.51 0.515 0.512
SVM 0.49 0.48 0.487 0.482
MLP 0.557 0.554 0.557 0.553

Model Metric Comparison

Model Metric (Accuracy, Precision, Recall, F1-Score) Comparison

Model Prediction

Random Forest Predictions

Random Forest; Confusion Matrix

Support Vector Machine Predictions

Support Vector Machine; Confusion Matrix

Multi Layer Perceptron Predictions

Multi Layer Perceptron; Confusion Matrix

Conclusion

Summary

  • MLP is the best-performing model for multi-class emotion recognition
  • Neural networks capture complex, non-linear patterns better than traditional or ensemble methods
  • Future Work: temporal modeling, multimodal integration, advanced feature extraction